Early Experiments on Prosody in Synthetic Speech
نویسندگان
چکیده
Synthetic speech needs prosody to get the right structure and to sound natural. Therefore, the emerging speech technology pushed the development of prosody models. Today, prosody research is well established with an own conference series, and powerful tools are available for investigating prosodic effects. The 80th birthday of the pioneer of quantitative prosody modeling, Professor Hiroya Fujisaki, is an excellent occasion to look at the situation in earlier times of speech technology. The authors give an outline using mainly the material which is available from the history in Dresden and Berlin. The oral presentation will be accompanied by numerous historic audio examples. 1 The pre-electronic era It is interesting to note that Wolfgang von Kempelen, the forefather of the modern speech synthesis, recognized the importance of the speech melody for his speaking machine: “Ich habe oft nachgedacht, ob man nicht [...] dahin kommen könnte [...], dieses Fallen und Steigen des Tones nach Willkühr zu bewirken und dadurch [...] wenigstens eine Abwechslung der Stimme bey dem Sprechen zu erhalten, welches meiner Maschine, die dermalen alles in einem Tone fortspricht, erst die rechte Annehmlichkeit geben würde.“ [1, p. 413]. He describes first attempts with a manual control. Figure 1 – Pre-electronic pitch recording. Left: Arrangement for recording the “throat sound” by a kymograph. Photograph from the historic acoustic-phonetic collection (HAPS) of the TU Dresden. – Right: Pitch contour (‘la diga’) produced by interpreting a kymographic recording [4]. 100 years later, the special interest of the experimental phonetics in measuring the pitch contour as one of the most important physical phenomena of the prosody was activated because many foreign languages (the “colonial languages”) had to be investigated. The analysis was performed mainly by interpreting the recordings of kymographs or phonographs (Figure 1). This very complicated and time-consuming process used a number of tools which we have described in [2, 3]. Of course, there was no possibility to verify the results by means of re-synthesis. 2 Analysis-by-synthesis: the vocoder 2.1 Development of the technology of channel vocoders There were different attempts in speech synthesis at the beginning of the electronic era. The real breakthrough was achieved with the invention of the channel vocoder by K. O. Schmidt [5] and H. W. Dudley [6]. The subdivision of the device in an analyzer and a synthesizer enabled an analysis-by-synthesis process in a very effective way [7]. The existence of a separate channel for the fundamental frequency allowed the demonstration of the effect of pitch manipulation and thus the experimental investigation of prosodic contours. Some sound examples from the Dresden vocoder (Figure 2) which was developed by E. Krocker [8] are still available. Figure 2 – Historic vocoders in Germany. Left – The Siemens vocoder in the background of the Siemens Studio for Electronic Musik, now in the Deutsches Museum in Munich. Right: Historic photograph of the Dresden vocoder. The left rack contained the analyzer channels, the two right racks the synthesizer channels [8]. 2.2 The experiments from Isačenko and Schädlich The analysis-by-synthesis activities in speech prosody go back to vocoder experiments. The linguists A. V. Isačenko (1910-1977, a well-known slavist) and H.-J. Schädlich (* 1935, later known as a novelist) were among the first who developed models for the quantitative description of prosodic effects [9]. The English translation of their report [10] includes a disk with some of the test sentences. This test material consists of German sentences with a fundamental frequency which was manipulated to have only two values, e. g. (from [9]): Experiments showed that there is still enough prosodic information to recognize the correct grammatical structure of the sentences. The manipulation was performed using the Dresden vocoder with support of W. Tscheschner and later with the Ericsson vocoder, supported by G. Fant. 3 Prosodic experiments with formant synthesizers 3.1 Development of formant synthesis The first channel vocoders have been large and expensive. There was some doubt whether they could be widely used in commercial applications. Also, the speech signal had “inhuman” quality and limited comprehensibility. It became clear that there are more effective kinds of parameterization of the speech signal, and other vocoder types than the channel vocoder arose. Formant coding proved to be a very effective approach. Consequently, the early types of speech synthesis terminals also followed the principle of formant synthesis. This development was strongly influenced by the work of G. Fant and can be illustrated using the history at different places. We have described this way of early speech synthesis especially at the TU Dresden under the guidance of W. Tscheschner (1927-2004) in [11]. The prosodic investigations which are described in the following section are connected to the ROSY project of the 1970-th. ROSY was a process computer controlled four-formant speech synthesizer. A small series of the synthesizers was produced by the Dresden computer company Robotron where the name of the device comes from (RObotron SYnthesizer). Formant synthesizers are very well suited for prosodic experiments (and even for singing) due to the presence of a separate excitation generator with controllable pitch. Figure 3 – Prosodic experiments with ROSY 4200. Left: Experimental setup with the synthesizer terminal ROSY (middle right) and the contour generator (above). The control computer is not shown. – Right: Models of suprasegmental fundamental frequency contours from [13]. 3.2 Prosodic investigations The prosody research for the speech synthesizers of the TU Dresden was performed in close cooperation with the Humboldt University at Berlin. It can be divided in two phases. In the first one, the microintonation at the sound transitions of German was investigated using natural speech material. Different types of transitions were classified, and a group of five was finally proposed for the application in speech synthesis [12]. They were implemented in the hardware of the ROSY synthesizer. In the second phase, analysis-by-synthesis experiments on the German macrointonation had been performed [13] with synthetic speech. For this purpose, the synthesis terminal ROSY was complemented by a contour generator which allowed influencing the intonation of the synthesizer by hardware. Basing on listening experiments, a number of standard contours could be proposed for the speech synthesis (Figure 3). Some examples of the test sentences in different intonation versions (monotonous / linear declination / declination plus accentuation) are still available as audio files. 4 Prosody in concatenative speech synthesis 4.1 Concatenation of waveforms in time domain The idea to synthesize natural sounding speech by concatenating speech segments from a database with real speech is not really new. With the invention of the magnetic storage of audio signals, the idea of the so-called concatenative synthesis emerged. Single sounds which were naturally spoken could be stored and re-ordered into a new sequence. The synthesizer “Lora” is an early example. It consisted of a stapled series of storage elements like that in Figure 4. The different elements were equipped with pieces of magnetic tape storing the particular sounds. All elements were arranged in parallel, and the selection of the proper element was controlled in a complicated way using a camshaft. The main problem, however, was the production of naturally sounding sound transitions. It is reported that the transitions were implemented using the Schwa as intermediate sound [14].
منابع مشابه
Prosody Diagnostic Using Reiterant Speech
This paper describes a set of experiments using French reiterant speech (on a canonical [ma] syllable). Experiments are designed to perform a diagnostic evaluation of the linguistic performances of synthetic prosody. Different experimental procedures are organised to match either synthetic or natural utterance, in their reiterant or lexicalised versions. The natural procedures are used as a ref...
متن کاملReiterant speech for the evaluation of natural vs. synthetic prosody
This work deals with some evaluation experiments on reiterant speech using both synthetic and natural stimuli. They have been designed to test the efficiency of the described paradigm to diagnose the adequacy of synthetic prosody to syntactic structure in reference with natural performances. Following a general methodology developed for synthesis [7], experiments have been conducted on the ICP ...
متن کاملProsody-based unit selection for Japanese speech synthesis
A corpus-based concatenative speech synthesis system using no signal processing can produce intelligible synthetic speech maintaining original voice characteristics. In such a concatenative system, it is very important to select appropriate waveform segments that are naturally close to the target prosody. But with a limited size database it can sometimes be di cult to realize natural prosody. T...
متن کاملJoint prosody prediction and unit selection for concatenative speech synthesis
In this paper we describe how prosody prediction can be efficiently integrated with the unit selection process in a concatenative speech synthesizer under a weighted finite-state transducer (WFST) architecture. WFSTs representing prosody prediction and unit selection can be composed during synthesis, thus effectively expanding the space of possible prosodic targets. We implemented a symbolic pr...
متن کاملClose Shadowing Natural Versus Synthetic Speech
Close shadowing experiments involving natural and synthetic stimuli are described. Preliminary results show that speakers are able to follow natural stimuli with an average delay of 70 ms whereas this delay typically exceeds 100 ms for stimuli produced by text-to-speech systems. A complementary experiment shows that this contrast is mainly due to the inappropriate or impoverished prosody genera...
متن کاملClose shadowing natural vs. synthetic speech
Close shadowing experiments involving natural and synthetic stimuli are here described. Preliminary results show that speakers are able to follow natural stimuli with an average delay less than 50 ms whereas this delay exceeds 100 ms for stimuli produced by Text-to-speech systems. A complementary experiment shows that this contrast is mainly due to prosody.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010